Introduction

We found the Forest Covertype in the UCI Machine Learning Repository that takes forestry data from the Roosevelt National Forest in northern Colorado (Click here for a tour of the area). The observations are taken from 30m by 30m patches of forest that are classified as one of seven forest types:

  1. Spruce/Fir
  2. Lodgepole Pine
  3. Ponderosa Pine
  4. Cottonwood/Willow
  5. Aspen
  6. Douglas-fir
  7. Krummholz

The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Kaggle hosted the dataset in a competition with a training set of 15,120 observations and a test set of 565,892 observations. The relative sizes of the training and test sets makes classification of cover type a challenging problem.

Data Exploration

Name Measurement Description
Elevation meters Elevation in meters
Aspect azimuth Aspect in degrees azimuth
Slope degrees Slope in degrees
Horizontal Distance To Hydrology meters Horz Dist to nearest surface water features
Vertical Distance To Hydrology meters Vert Dist to nearest surface water features
Horizontal Distance To Roadways meters Horz Dist to nearest roadway
Hillshade 9am 0 to 255 index Hillshade index at 9am, summer solstice
Hillshade Noon 0 to 255 index Hillshade index at noon, summer soltice
Hillshade 3pm 0 to 255 index Hillshade index at 3pm, summer solstice
Horizontal Distance To Fire Points meters Horz Dist to nearest wildfire ignition points
Wilderness Area (4 binary columns) 0 (absence) or 1 (presence) Wilderness area designation
Soil Type (40 binary columns) 0 (absence) or 1 (presence) Soil Type designation
Cover Type Classes 1 to 7 Forest Cover Type designation - Response Variable

Some class seperation is clearly visible in the following plots of elevation.

Another compelling variable is aspect, or the cardinal direction that the slope has the steepest gradient downwards. For example, in the rose diagram below, there are more Douglas-Fir trees for observations with northern aspects (near 0º) than southern aspects (near 180º).

One of the clearer relationships in the dataset was the distribution of Hillshade luminance. Hillshade is measured on a scale from 0 to 255 (dark to bright).

Hillshade at time t varies as a factor of: \[\cos(slope)\cos (90- Altitude) + \sin (slope)\sin (90-Altitude)\cos(Azimuth-Aspect)\] where Altitude is the angle of the Sun relative to the horizon and Azimuth relates to the direction the Sun is facing: North, South, East, or West. Azimuth of 90 degrees corresponds to East.

This equation actually arises from a theorem in Spherical geometry known as “Spherical law of Cosines” relating the sides and angles of triangles constructed on spherical surfaces.

In a unit sphere, lengths a, b, c correspond to the angle subtended by those sides from the center of the sphere. If we know the two sides a, b and the angle between them C, then the cosine of c, is given by:

In short, the Illumination of the patch is related to alitude of the sun, slope of the terrain and the relative direction of the sun and the slope.

\(Hillshade(t_1,t_2,t_3)\) has has been plotted below in 3 dimensions:

More importantly, in the context of class seperation, The following plot of elevation, slope, and hillshade clearly seperates the forest cover types, demonstrating that elevation could be the most significant factor in determining cover type.

The primary reason for the collection of cartographic data pertains to terrain mapping and is ultimately useful for applying topographic correction to satellite images in remote sensing or as background information in scientific studies.

Topographic correction is necessary if, for example, we wish to identify materials on the Earth’s surface by deriving empirical spectral signatures, or to compare images taken at different times with different Sun and satellite positions and angles. By applying the corrections, it is possible to transform the satellite-derived reflectance into their true reflectivity or radiance in horizontal conditions.

Modeling

Logistic Regression

We first investigated multinomial logistic regression. Since classical logistic regression requires that each of the variables are in independent, we made sure to run our logistic regressions with regularization. Using the dataset x, the probability of the response class k of 1 through 7 is the following:

\[\mbox{Pr}(G=k|X=x)=\frac{e^{\beta_{0k}+\beta_k^Tx}}{\sum_{\ell=1}^7e^{\beta_{0\ell}+\beta_\ell^Tx}}\]

We then transfer this into the elastic-net penalized negative log-likelihood function (the first term of the following equation):

\[\ell(\{\beta_{0k},\beta_{k}\}_1^7) = -\left[\frac{1}{N} \sum_{i=1}^N \Big(\sum_{k=1}^7y_{il} (\beta_{0k} + x_i^T \beta_k)- \log \big(\sum_{k=1}^7 e^{\beta_{0k}+x_i^T \beta_k}\big)\Big)\right] +\lambda \left[ (1-\alpha)||\beta||_F^2/2 + \alpha\sum_{j=1}^p||\beta_j||\right]\]

Where k is the response class (cover type from 1 to 7) and N is the number of observations. The second term in the equation represents the regularization term.

The elastic net penalty \(\alpha\) varies from 0 to 1. When \(\alpha = 0\), the function reduces to Ridge regularization. When \(\alpha = 1\), Lasso regularization is used. The Ridge penalty shrinks the coefficients of closely correlated predictors wheras the Lasso tries to pick one and discard others.

Ridge Regularization

We performed 10-fold cross validation on a ridge regularization and on a grid of 100 values of \(\lambda\), choosing the minimum \(\lambda\) for each model. When we ran the regression with a 70-30 cross validation split on the training set, we got an accuracy on the test set on Kaggle of 0.5570 (rank of 1,489th place).

When we reran the ridge regression with an 85-15 split on the training data, we got a Kaggle accuracy of 0.5956 (1,414th place). Since the test set is so much larger than our training set, it was beneficial to incorporate a larger proportion of the data for learning within the training data before runnint the model on the larger testing data.

Lasso Regularization

A 10-fold cross validated lasso-regularized logistic regression kaggle score was 0.59594, 1411th place for 70-30 split kaggle score of 0.59526, 1415th place for 85-15 split kaggle score of 0.59443 if we took the lowest 50 percintile of lambdas and calculated the mode of the predictions.

Elastic-net Regularization

Using both the ridge and lasso regularization terms concurrently produces an elastic-net regularization. We ran this hybrid method with both a 70-30 split and 85-15 split in the training set. When running these models on the testing set, we got accuracies on Kaggle of 0.5959 (1,412th place) and 0.5952 (1,415th place), respectively.

We also tried an elastic-net regression using all 15,120 observations in the training set. It earned an accuracy score on the testing set of 0.5950 (1,417th place).

Random Forest

Boosting Using XGBoost

Boosting Using H2O’s Gradient Boosting Algorithm

Extremely Randomized Trees

Artificial Neural Networks

Single Hidden Layer Using the nnet Package

Deep Learning using H2O

Support Vector Machines ?

Ensembling

Results Summary

Model |

Logistic Regression - Ridge Lasso Elastic-net

References: